ONS / NISR
2021
Web scraping (Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique used to automatically extract large amounts of data from websites and save it to a file or database.
The Internet is a data store of world's information - be it text, media or data in any other format. Every web page display data in one form or the other. Access to this data is crucial for the success of most businesses in the modern world. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display to your local storage, or to your own website.
Web Scraping is used for getting data. Data collection and analysis is important even for government, non-profit and educational institutions.
The following are few of the many common applications of Web Scraping:
In eCommerce, Web Scraping is used for competition price monitoring.
In Marketing, Web Scraping is used for lead generation, to build phone and email lists for cold outreach.
In Real Estate, Web Scraping is used to get property and agent/owner details.
Web Scraping is used to collect training and testing data for Machine Learning projects.
One of the most frequent questions which comes to your mind once you have decided to scrape data is whether the process of web scraping is legal or not. Scraping data which is already available in public domain is legal as long as you use the data ethically.
Whilst the process of web scraping is legal, consideration should be given to the data that you're attempting to collect. Whilst it may be in the public domain, you may not have a legal standing to collect personal or copyrighted data.
Personal Data - As a rule of thumb, it is recommended to have a lawful reason to obtain, store and use personal data without the user’s consent.
Copyrighted Data - It is not illegal to scrape copyrighted data as long as you don’t plan to reuse or publish it.
The backbone of any web page is HTML. This is a relatively simple markup language that uses <tags>, denoted by angle brackets, to markup different elements.
Open https://www.statistics.gov.rw in any web browser and right click on the page and select View Source.
As HTML is just a series of <tags> written in plain text, we can create a web page that can be rendered in any browser just using a text editor.
Create a new file called my_webpage.html and add the following text.
<html> <!-- Open the HTML tag to declare that everything inside is HTML -->
<body> <!-- Open the body tag, this is where we can write visible elements -->
<h1>Page title</h1> <!-- h1 stands for Heading, see the use of </> to close the tag -->
<p>This is my webpage.</p> <!-- p stands for paragraph -->
</body> <!-- Close the body tag -->
</html> <!-- Close the HTML tag-->
There are plenty of other <tags> we can use in HTML, a full list can be found here
Some common ones you'll see are listed below
| Tag | Usage |
|---|---|
<div> |
Used to group elements together, or to provide structure to the web page |
<span> |
Used to group elements and to provide structure behaves slightly differently to <div> |
<img> |
Adds an image to the web page |
<table>, <th>, <tr>, <td> |
Defines a table in HTML with the sub-elements defining the table header, table row and table cell respectively. |
<a> |
Create a hyperlink around a specific element |
<b>, <i> |
Create bold and italic elements respectively |
<ol>, <ul>, <li> |
Create ordered and unordered lists where <li> tags list items. |
Lets create a second web page called my_complex_webpage.html that incorporates some of these other HTML elements.
<html>
<body>
<h1>My Complex Webpage</h1>
<p>This is my more complex webpage with additional elements</p>
<a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a>
<p>Below here is the NISR logo</p>
<img src="https://www.statistics.gov.rw/sites/default/files/images/logo.png">
<h2>This is an unordered list of fruits</h2>
<ul>
<li>Apple</li>
<li>Banana</li>
</ul>
<h2>This is a HTML table</h2>
<table>
<tr><th>Column 1</th><th>Column2</th><th>Column3</th></tr>
<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>4</td><td>5</td><td>6</td></tr>
<tr><td>7</td><td>8</td><td>9</td></tr>
</table>
</body>
</html>
<HTML> is good for structure but it isn't very useful for styling elements on a web page. That's where Cascading Style Sheets (CSS) comes in. CSS is a separate language that allows us to apply "styles" to elements on our HTML web page.
For example if we wanted to set the background of our web page to black and the font colour to white we could use the following CSS code.
/* The body tells the browser to only apply the contained styles onto the <body> element */
body {
background: black; /* Set the page background to black */
color: white; /* Set the page font colour to white */
}
Save the above code as style.css
There are two ways to add CSS to our web page. We can add it directly into the HTML document using the <style> tags. More commonly you'll see CSS stored in a separate .css file which is linked in the .html file using the <head> and <link> tags.
The <head> tag is like the body tag, but used for store additional meta information that isn't directly displayed on the page.
<html>
<head>
<link rel='stylesheet' href='style.css'>
</head>
<body>
...
</body>
</html>
Create a copy of my_complex_webpage.html and add the <head> and <link> tags as described above.
CSS is able to define styles not just for types of elements (i.e. <body>, <li>, <p>) but it can also define classes we can be applied to numerous elements.
/* The "." at the start of the definition tells HTML to apply this style to any elements */
/* that have the specified class name. */
.red_text {
color: red;
}
Add this style to your style.css file.
We can now use the the class attribute on any HTML element to assign this specific style to specific elements.
<html>
<head>
<link rel='stylesheet' href='style.css'>
</head>
<body>
<h1>My Complex Webpage</h1>
<p class="red_text">This is my more complex webpage with additional elements</p>
...
<h2 class="red_text">This is an unordered list of fruits</h2>
<ul>
<li class="red_text">Apple</li>
<li>Banana</li>
</ul>
...
<head>
<html>
Edit your copy of my_complex_webpage.html to include the class attribute on some tags.
Pandas has a built in function called read_html that allows us to read HTML tables directly from a web page. We can try this with the web page we just finished creating by using the following code.
import pandas as pd
df = pd.read_html('./my_complex_webpage.html')
df
Column 1 Column2 Column3 0 1 2 3 1 4 5 6 2 7 8 9
Pandas correctly found our table parsing out all our other HTML, but by default read_html will return a list of all tables that pandas can find on the web page, even if its only one.
Pandas is also able to filter out any of the CSS that's been applied to our tables as well, returning on the data.
import pandas as pd
# Select the first / only dataframe in the list
df_no_css = pd.read_html('./my_complex_webpage.html')[0]
df_css = pd.read_html('./my_complex_webpage_with_css.html')[0]
# This will error if the dataframes aren't identical.
pd.testing.assert_frame_equal(df_no_css, df_css)
Obviously real world websites are much messier than our example page, so we will also need to employ some basic data cleaning techniques to deal with these real world examples.
Lets look at the wikipedia page for the Rwandan Men's National Basketball Team. There are lots of different tables, in different styles, some with images, some with complex headers. We can throw the URL directly into to read_html and see what comes out.
import pandas as pd
basketball_tables = pd.read_html('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')
print(f'Tables found: {len(basketball_tables)}')
Tables found: 13
Often web developers will use <table> tags as a structural element, rather than to explicitly display some data. Note that the 0th index in basketball_tables doesn't refer to the first visible table, but instead the information card in the top left corner of the page.
Looking through the 13 parsed tables, we can find the current roster table at position 4, but as Wikipedia can change we want to be able to write some code that always selects the roster table. We can do that using the keyword match. The match keyword will return any table containing the string passed.
Once we've done that we can add our usual skiprows and header arguments to make sure the correct row is being used as the header of the table.
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
roster = pd.read_html(url,
match="Rwanda men's national basketball team roster",
skiprows=1,
header=2)[0]
roster.head()
| Pos. | No. | Name | Age – Date of birth | Height | Club | Ctr. | |
|---|---|---|---|---|---|---|---|
| 0 | PG | 4 | Jean Nshobozwabyose | 23 – | 1.83 m (6 ft 0 in) | Patriots | NaN |
| 1 | G | 5 | Ntore Habimana | 24 – | 1.96 m (6 ft 5 in) | Wilfrid Laurier Golden Hawks | NaN |
| 2 | SG | 6 | Steven Hagumintwari | 27 – | 1.93 m (6 ft 4 in) | Patriots | NaN |
| 3 | SG | 7 | Armel Sangwe | 24 – | 1.90 m (6 ft 3 in) | Espoir | NaN |
| 4 | SG | 8 | Emile Kazeneza | 20 – | 2.01 m (6 ft 7 in) | William Carey University | NaN |
We've now code that can scrape that table whenever we want. However, something looks a little wrong with the Age - Date of Birth column. Not all the data has been scraped, notably the actual dates of birth.
This is because there is hidden data within these cells. Pandas will stop scraping a cell if it hits hidden data unless we explicitly tell it not to using displayed_only=False.
url = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
roster = pd.read_html(url,
match="Rwanda men's national basketball team roster",
skiprows=1,
header=2,
displayed_only=False)[0]
roster.head()
| Pos. | No. | Name | Age – Date of birth | Height | Club | Ctr. | |
|---|---|---|---|---|---|---|---|
| 0 | PG | 4 | Jean Nshobozwabyose | 23 – (1998-06-26)26 June 1998 | 1.83 m (6 ft 0 in) | Patriots | NaN |
| 1 | G | 5 | Ntore Habimana | 24 – (1997-08-15)15 August 1997 | 1.96 m (6 ft 5 in) | Wilfrid Laurier Golden Hawks | NaN |
| 2 | SG | 6 | Steven Hagumintwari | 27 – (1993-10-01)1 October 1993 | 1.93 m (6 ft 4 in) | Patriots | NaN |
| 3 | SG | 7 | Armel Sangwe | 24 – (1997-04-15)15 April 1997 | 1.90 m (6 ft 3 in) | Espoir | NaN |
| 4 | SG | 8 | Emile Kazeneza | 20 – (2000-08-30)30 August 2000 | 2.01 m (6 ft 7 in) | William Carey University | NaN |
There we go, now we've got all the data we want from the table. Unfortunately, as Wikipedia has used images rather than text to represent the countries of the players, we're unable to scrape them using pandas.
We'll look at other methods to get this data later.
Obviously we've seem some of the limitations already, notably pandas not being to parse images and it collecting tables that aren't relevant to our intended goal. However in most web pages, the data we want to scrape wont be formatted into a nice table for us. If it isn't in <table> tags then we wont be able to scrape it using pandas.
However there are lots of other methods for accessing that data, but first we need to understand a little of how websites function.
Now that we understand the structure of a web page, we can see how it might be extremely tedious to create every individual web page, especially if we want to include regularly changing data.
That's why most web pages are created dynamically. This means that the web page is put together on-the-fly whenever someone requests to see it.
Web pages are usually generated in one of two ways, via client-side scripting or via server-side script. This defines where the data gets turned into HTML elements. If it is on the client-side, then the raw data is sent directly to our browser and our computer creates the web page, if it is server-side then we don't ever see the raw data, only the computed HTML elements.
| Client-side Scripting | Server-side scripting |
|---|---|
| Data usually processed with Javascript | Data can be processed with php, Javascript, python etc. |
| Is possible to see the underlying data | Is not possible to see the underlying data |
We've already looked at a web page's source by using View page source. There is a more advanced tool for working with web pages built into most browsers, usually called Inspect. Right-Click > Inspect. Lets inspect the wikipedia page for the Rwandan Men's National Basketball Team.
We'll come back to the Elements page later, for now we want to look at the Network Tab on the toolbar.
The network tab records all the requests that go between our browser and the server (as well as other servers) in the production of the web page. When you first open the page it will be blank. Refreshing the browser page will cause the network tab to record all the different requests that occur.
Clicking on any one of the files that has been requested, you can see the full HTTP request (more on this later) as well as a preview and actual full response from the server for that request. Looking at the response for the first request (the page) we can see that the data was included directly in the page as HTML. This implies that this particular page was processed server-side. Another clue was the reference to php which is an exclusively server-side language.
Lets look at the NBA website instead. This page shows us statistics for the regular season for players in the NBA ordered by the number of points.
We could try and scrape this data using pandas but lets see if we can find the source of this data first. Opening up the Inspect tool we can look at the Network tab to try and find where this data is loaded from.
There are a lot of files that are loaded as part of this web page. We can reduce the number we need to search through by using the built in filters on the Network tab. Lets look at Fetch/XHR which filters the list to requests usually associated with data.
Looking through this shorter list of files, one stands out as potentially containing the data that we want to extract from the web page.
We can click on the file and the the Response tab to see what information is sent to our browser from this file. Looking at the response we can see that the data that goes into our table is not encoded into HTML so we can be relatively sure that web page is generated at least partly on the client side.
HTMLHTML served. The requests library is the de-facto standard for making HTTP requests, it abstracts away all of the complexity we just saw using the Inspect tool. requests is a built in library so there is no need to install.
The requests library is very powerful, but importantly we can use it to do in python what our web browser was doing when it loaded in our data.
Returning to our NBA example. Looking at the Network tab shows us all the of the HTTP requests that have been made in the process of creating the web page that we see.
If we look in the Headers tab, we can see the form that this HTTP request took.
The URL had the request information encoded into it, we can also see that the request type is GET.
Lets see what happens if we recreate that request in Python using the requests library. First we need to get the request URL from the header tab. We also need to note that the method is GET.
import requests
url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'
# We are using the .get method to match the GET HTTP request
# we also include the .json() method to return to us the response
# from the request as a python dictionary.
response = requests.get(url).json()
print(response)
We can see that the result of that command is the same as the data we saw using the Inspect tool. We can look through this nested dictionary object to try and understand the structure of the response. It is important to note that not every response will look the same. You'll need to dig into each response to extract the data as and when.
We can look through the object and see if there is a way to convert information into a table that we can use.
print(response.keys())
print(response['resultSet'].keys())
dict_keys(['resource', 'parameters', 'resultSet']) dict_keys(['name', 'headers', 'rowSet'])
Looking at the keys in the data, we can see that the response contains three objects called resource, parameters and resultSet. resource and parameters are metadata about the table that we've just requested. resultSet contains another dictionary with the keys name, headers and rowSet. rowSet is a list of list, each representing a row of data and the headers contains a list of column headers.
We can put these together using pandas into a dataframe very easily.
import requests
import pandas as pd
url = 'https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2021-22&SeasonType=Regular+Season&StatCategory=PTS'
response = requests.get(url).json()
table_headers = response['resultSet']['headers']
table_data = response['resultSet']['rowSet']
df = pd.DataFrame(table_data, columns=table_headers)
df
| PLAYER_ID | RANK | PLAYER | TEAM | GP | MIN | FGM | FGA | FG_PCT | FG3M | ... | FT_PCT | OREB | DREB | REB | AST | STL | BLK | TOV | PTS | EFF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201142 | 1 | Kevin Durant | BKN | 12 | 34.4 | 11.2 | 19.1 | 0.585 | 1.9 | ... | 0.829 | 0.5 | 8.0 | 8.5 | 5.0 | 0.6 | 0.7 | 3.5 | 29.5 | 31.8 |
| 1 | 201939 | 2 | Stephen Curry | GSW | 11 | 33.6 | 8.6 | 19.9 | 0.434 | 5.0 | ... | 0.949 | 0.8 | 5.6 | 6.5 | 6.5 | 1.6 | 0.6 | 3.1 | 27.4 | 28.0 |
| 2 | 202331 | 3 | Paul George | LAC | 11 | 35.3 | 10.0 | 21.9 | 0.456 | 3.2 | ... | 0.867 | 0.5 | 7.3 | 7.8 | 5.4 | 2.5 | 0.5 | 4.5 | 26.7 | 26.0 |
| 3 | 203507 | 4 | Giannis Antetokounmpo | MIL | 12 | 32.9 | 9.5 | 19.2 | 0.496 | 1.3 | ... | 0.688 | 1.9 | 9.9 | 11.8 | 6.0 | 1.1 | 1.8 | 3.0 | 26.6 | 31.8 |
| 4 | 1629630 | 5 | Ja Morant | MEM | 11 | 35.3 | 10.0 | 20.6 | 0.485 | 1.7 | ... | 0.779 | 1.3 | 4.5 | 5.7 | 7.3 | 1.7 | 0.3 | 4.0 | 26.5 | 25.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 274 | 1629216 | 275 | Gabe Vincent | MIA | 10 | 8.9 | 0.8 | 2.0 | 0.400 | 0.1 | ... | 1.000 | 0.3 | 0.5 | 0.8 | 1.7 | 0.2 | 0.0 | 0.6 | 1.9 | 2.8 |
| 275 | 203085 | 276 | Austin Rivers | DEN | 9 | 12.4 | 0.8 | 3.0 | 0.259 | 0.2 | ... | 0.500 | 0.4 | 0.7 | 1.1 | 0.8 | 0.3 | 0.1 | 0.7 | 1.9 | 1.2 |
| 276 | 1630541 | 276 | Moses Moody | GSW | 9 | 6.8 | 0.8 | 2.0 | 0.389 | 0.2 | ... | 0.500 | 0.0 | 0.9 | 0.9 | 0.3 | 0.0 | 0.1 | 0.1 | 1.9 | 1.8 |
| 277 | 1626161 | 278 | Willie Cauley-Stein | DAL | 11 | 10.0 | 0.7 | 1.7 | 0.421 | 0.0 | ... | 0.000 | 0.7 | 1.6 | 2.4 | 0.5 | 0.3 | 0.0 | 0.2 | 1.5 | 3.4 |
| 278 | 1630215 | 279 | Jared Butler | UTA | 10 | 4.6 | 0.5 | 1.9 | 0.263 | 0.2 | ... | 0.500 | 0.0 | 0.6 | 0.6 | 0.6 | 0.0 | 0.5 | 0.6 | 1.4 | 0.9 |
279 rows × 24 columns
If we look closer at the URL, we can see it encodes a lot of arguments, these arguments look very similar to the filters that are available on the web page.
https://stats.nba.com/stats/leagueLeaders?
LeagueID=00&
PerMode=PerGame&
Scope=S&
Season=2021-22&
SeasonType=Regular+Season&
StatCategory=PTS
If we change "PerGame" to "Totals" and re-run our code then we should get data that would inform website table had we selected that option. What we've done here is discover the API that sits behind the NBA website and we can exploit this to extract data.
| PLAYER_ID | RANK | PLAYER | TEAM | GP | MIN | FGM | FGA | FG_PCT | FG3M | ... | REB | AST | STL | BLK | TOV | PF | PTS | EFF | AST_TOV | STL_TOV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201142 | 1 | Kevin Durant | BKN | 12 | 413 | 134 | 229 | 0.585 | 23 | ... | 102 | 60 | 7 | 8 | 42 | 16 | 354 | 381 | 1.43 | 0.17 |
| 1 | 203507 | 2 | Giannis Antetokounmpo | MIL | 12 | 395 | 114 | 230 | 0.496 | 16 | ... | 142 | 72 | 13 | 21 | 36 | 36 | 319 | 381 | 2.00 | 0.36 |
| 2 | 201939 | 3 | Stephen Curry | GSW | 11 | 370 | 95 | 219 | 0.434 | 55 | ... | 71 | 72 | 18 | 7 | 34 | 17 | 301 | 308 | 2.12 | 0.53 |
| 3 | 202331 | 4 | Paul George | LAC | 11 | 388 | 110 | 241 | 0.456 | 35 | ... | 86 | 59 | 28 | 5 | 49 | 31 | 294 | 286 | 1.20 | 0.57 |
| 4 | 1629630 | 5 | Ja Morant | MEM | 11 | 388 | 110 | 227 | 0.485 | 19 | ... | 63 | 80 | 19 | 3 | 44 | 16 | 292 | 281 | 1.82 | 0.43 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 443 | 1630536 | 418 | Sharife Cooper | ATL | 2 | 7 | 0 | 3 | 0.000 | 0 | ... | 0 | 2 | 0 | 0 | 1 | 0 | 0 | -2 | 2.00 | 0.00 |
| 444 | 1629605 | 418 | Tacko Fall | CLE | 3 | 3 | 0 | 1 | 0.000 | 0 | ... | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.00 | 0.00 |
| 445 | 1628962 | 418 | Udoka Azubuike | UTA | 2 | 2 | 0 | 0 | 0.000 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 |
| 446 | 1630176 | 418 | Vernon Carey Jr. | CHA | 1 | 1 | 0 | 1 | 0.000 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 |
| 447 | 1627782 | 418 | Wayne Selden | NYK | 1 | 1 | 0 | 0 | 0.000 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 |
448 rows × 27 columns
Sometimes, in-fact most of the time, the information that we want to scrape wont be found neatly formatted into a table. What we need to be able to do is extract the relevant information programatically from non-table objects. Enter beautifulsoup. beautifulsoup is a HTML parsing library for python, it allows us to pull out all the relevant information from a web page using a nice and easy to use syntax.
beautifulsoup does not come as part of the standard python installation so we need to pip install it. We can do this inside our jupyternotebook using
!pip install beautifulsoup4
Or just on the command line by running the same command, without the ! at the begining of the line.
Requirement already satisfied: beautifulsoup4 in /opt/miniconda3/lib/python3.9/site-packages (4.10.0) Requirement already satisfied: soupsieve>1.2 in /opt/miniconda3/lib/python3.9/site-packages (from beautifulsoup4) (2.3)
Once we've installed beautiful soup we can start to use it to parse our HTML data. Lets start again by parsing the web page that we made earlier.
from bs4 import BeautifulSoup
with open('./my_complex_webpage.html', 'r') as f:
soup = BeautifulSoup(f, 'html.parser')
print(soup)
<html> <head> <link href="style.css" rel="stylesheet"/> </head> <body> <h1>My Complex Webpage</h1> <p class="red_text">This is my more complex webpage with additional elements</p> <a href="https://www.statistics.gov.rw">This is a link to https://www.statistics.gov.rw</a> <p>Below here is the NISR lo ...
beautifulsoup has lots of functions that make it very easy to extract information from a HTML page. The most useful of which is the find_all() method. Full documentation for the find_all method can be found here.
Before we were able to use pandas to extract the HTML table very easily, but what if we were more interested in the "Unordered list of fruits". We can use the find_all function to retrieve all of the list item <li> tags.
soup.find_all('li')
[<li class="red_text">Apple</li>, <li>Banana</li>]
We've successfully extracted all of the <li> tags, but our data still isn't very clean. We're not interested in the HTML tag, just the data contained within. We can deal with this by using beautifulsoup to strip out our HTML tags.
# We can do this with a loop
for tag in soup.find_all('li'):
print(tag.get_text())
# Or by using a list comprehension
[tag.get_text() for tag in soup.find_all('li')]
Apple Banana
['Apple', 'Banana']
Success, however it will be common that not all the information we want to extract has the same <tag> or there will be lots of irrelevant information that has the same <tag>. Fortunately, when people are designing web pages they tend to give similar information the same visual appearance. We know that visual appearance is controlled by css and using beautifulsoup we can extract data by css class!
for red_text in soup.find_all(class_="red_text"):
print(red_text)
[red_text.get_text() for red_text in soup.find_all(class_='red_text')]
<p class="red_text">This is my more complex webpage with additional elements</p> <h2 class="red_text">This is an unordered list of fruits</h2> <li class="red_text">Apple</li>
['This is my more complex webpage with additional elements', 'This is an unordered list of fruits', 'Apple']
Lets go back to our wikipedia example, if we remember we were able to use pandas to scrape the table, but weren't able to get all of the country information because this wasn't stored as plain text. We can use beautifulsoup to parse out that information with much finer control.
First we need to get the HTML that generates that wikipedia page. We can do this using our trusty requests library.
import requests
URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL)
print(wiki_page)
<Response [200]>
Oh, this is just a response code, not the HTML that we were expecting. Fortunately Response [200] means that the request successfully executed. In order to get the HTML we need to use the .text attribute.
import requests
URL = 'https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team'
wiki_page = requests.get(URL).text
print(wiki_page)
<!DOCTYPE html> <html class="client-nojs" lang="en" dir="ltr"> <head> <meta charset="UTF-8"/> <title>Rwanda men's national basketball team - Wikipedia</title> <script>document.documentElement.classNam ...
Now we can parse this HTML code with beautiful soup, as we're only interested in the roster table, we can tell beautifulsoup to filter out all the HTML that isn't related to the roster table.
tables = soup.find_all('table')
print(f'Found {len(tables)} tables.\n')
# Filter list of tables to just those that contain a country
# column called Ctr.
country_tables = [tbl for tbl in tables if 'Ctr.' in str(tbl)]
# This is more complex html than usual, there is a table in a table, so we need to
# select the second country_table which represents the inner table.
roster_html = country_tables[1]
Found 13 tables. <table class="sortable" style="background:transparent; margin:0px; width:100%;"> <tbody><tr> <th><abbr title="Position(s)">Pos.</abbr></th> <th><abbr title="Number">No.</abbr></th> <th>Name</th> <th>Age – <small>Date of birth</small></th> <th>Height</th> <th>Club</th> <th><abbr title="Country">Ctr.< ...
Try parsing the roster_html Beautiful Soup into a pandas dataframe.
# Use a list comprehension to look for all the <th> tags, for each
# one, get the text and strip the result. These are the column headers
# for the table.
header = [col.get_text().strip() for col in roster_html.find_all('th')]
# Create an empty list to store our processed rows.
rows = []
# Loop over all of the <tr> tags, each set corresponds to a row
# row in our table.
for tr in roster_html.find_all('tr')[1:]:
# Create an empty row variable where we can store all of our processed
# data
row = []
# Loop over all of the <td> tags inside the current <tr> tag. These are
# going to be our data items.
for data in tr.find_all('td'):
# If the data item isn't blank (or just a a new line character)
# then add it to our row, stripping out the excess whitespace
if data.get_text() != '\n':
row.append(data.get_text().strip())
# If there is an <img> tag in the <td> tag then we're on our
# flag column. We want to extract the country information.
# We could extract this from the image, but all the images are
# wrapped in a <a> hyperlink tag to that country, which will be
# easier to clean.
if data.find('img') is not None:
# Get the <a> hyperlink tag
img = data.find('a')
# Add the href attribute (this is the link address) to our row
row.append(img['href'])
# Finally add the row into our list of rows.
rows.append(row)
# Construct a dataframe from our list of rows and our header data
df = pd.DataFrame(rows, columns=header)
df
| Pos. | No. | Name | Age – Date of birth | Height | Club | Ctr. | |
|---|---|---|---|---|---|---|---|
| 0 | PG | 4 | Jean Nshobozwabyose | 23 – (1998-06-26)26 June 1998 | 1.83 m (6 ft 0 in) | Patriots | /wiki/Rwanda |
| 1 | G | 5 | Ntore Habimana | 24 – (1997-08-15)15 August 1997 | 1.96 m (6 ft 5 in) | Wilfrid Laurier Golden Hawks | /wiki/Canada |
| 2 | SG | 6 | Steven Hagumintwari | 27 – (1993-10-01)1 October 1993 | 1.93 m (6 ft 4 in) | Patriots | /wiki/Rwanda |
| 3 | SG | 7 | Armel Sangwe | 24 – (1997-04-15)15 April 1997 | 1.90 m (6 ft 3 in) | Espoir | /wiki/Rwanda |
| 4 | SG | 8 | Emile Kazeneza | 20 – (2000-08-30)30 August 2000 | 2.01 m (6 ft 7 in) | William Carey University | /wiki/United_States |
| 5 | SG | 9 | Dieudonné Ndizeye | 24 – (1996-10-14)14 October 1996 | 1.98 m (6 ft 6 in) | Patriots | /wiki/Rwanda |
| 6 | PF | 10 | Olivier Shyaka | 26 – (1995-08-14)14 August 1995 | 2.00 m (6 ft 7 in) | REG | /wiki/Rwanda |
| 7 | F | 11 | Alex Mpoyo | 24 – (1997-01-05)5 January 1997 | 2.01 m (6 ft 7 in) | Trepça | /wiki/Kosovo |
| 8 | SG | 12 | Kenny Gasana | 36 – (1984-11-09)9 November 1984 | 1.90 m (6 ft 3 in) | Patriots | /wiki/Rwanda |
| 9 | C | 13 | Elie Kaje | 26 – (1995-03-17)17 March 1995 | 1.90 m (6 ft 3 in) | Patriots | /wiki/Rwanda |
| 10 | C | 16 | Prince Ibeh | 27 – (1994-06-03)3 June 1994 | 2.06 m (6 ft 9 in) | Patriots | /wiki/Rwanda |
| 11 | SF | 17 | William Robeyns | 25 – (1996-02-23)23 February 1996 | 1.91 m (6 ft 3 in) | Phoenix Brussels | /wiki/Belgium |
Lets look at less tabular data. This is the fiba.basketball news page. We can see there is list news articles, with headlines, dates and a small blurb. To start lets inspect one of the items and see if we can see anything common that link on.
All the additional news articles are in <div> tags with the class related_row.
<div class="related_row">
<a href="http://www.fiba.basketball/afrobasket/2021/qualifiers/news/enabu-rwahwire-reflect-on-uganda-tenacity-ahead-of-morocco-cracker">
<div class="related_top right">
<div class="date_highlighted">07/07/2021</div>
<div class="category" style="background-color: #000000;">News</div>
<h6>Enabu, Rwahwire reflect on Uganda tenacity ahead of Morocco cracker</h6>
</div>
<div class="related_image left adaptive_image" data-adaptive-image-breakpoints="{ default: '/images.fiba.com/Graphic/2/F/7/8/8iqS79S7ukebtkZoyPWr0Q.jpg?v=20210113123220303', 480: '/images.fiba.com/Graphic/F/3/0/5/b8Zt49T0X0CHF0MVS2z46Q.jpg?v=20210113123217152' }" data-adaptive-image-extra-attrs="{ alt: '5 Jimmy Enabu (UGA)' }">
</div>
<div class="related_bottom right">
<p>SALE (Morocco) - Qualifying for the FIBA AfroBasket is a lifetime dream for many and for Uganda captain Jimmy Enabu, it is a befitting reward to a diligent servant of the game back home. </p>
</div>
</a>
</div>
We can see that all the information we want is stored inside this <div> tag with the class related_row. There is a <div> inside with the class date_highlighted that contains the date, one with the class category that contains the article category information. We can see the title of the article is wrapped in header <h6> tags and the blurb of the article is the only <p> tag within the <div>.
Using all this we can write a very simple loop to go through all of the related_row objects and pull out the pertinent information using the exact same methods we've already used.
Try parsing the FIBA news page into a pandas dataframe
| date | headline | category | blurb | |
|---|---|---|---|---|
| 0 | 07/07/2021 | Enabu, Rwahwire reflect on Uganda tenacity ahe... | News | SALE (Morocco) - Qualifying for the FIBA AfroB... |
| 1 | 11/03/2021 | Madagascar's Botou wants to keep the AfroBaske... | News | ANTANANARIVO (Madagascar) – Despite having mis... |
| 2 | 10/03/2021 | 10 standout performers from the last window of... | News | ABIDJAN - As we look back at the Second Round ... |
| 3 | 08/03/2021 | Guinea celebrate second straight AfroBasket ap... | News | Guinea left Cameroon at the end of the FIBA Af... |
| 4 | 02/03/2021 | Decisions concerning the February window of th... | News | FIBA has taken decisions regarding Equatorial ... |
| 5 | 26/02/2021 | Impressive operational efforts in FIBA Contine... | News | MIES (Switzerland) - Another successful window... |
| 6 | 24/02/2021 | "Senegal have some room for improvement," says... | News | DAKAR (Senegal) - Senegal finished top of Grou... |
| 7 | 23/02/2021 | History Makers Kenya's confidence is sky-high | News | By beating eleven-time Africa champions Angola... |
| 8 | 21/02/2021 | Four teams undefeated at the end of AfroBasket... | Review | MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t... |
| 9 | 21/02/2021 | Top performers at Day 3 in Yaounde | News | YAOUNDE (Cameroon) - There was great frenzy at... |
| 10 | 21/02/2021 | Top performers as curtains fall on FIBA AfroBa... | News | MONASTIR (Tunisia) - With prestige and honor a... |
| 11 | 21/02/2021 | Aristide Mouaha from mop boy to Cameroon inter... | News | YAOUNDE (Cameroon) - The game was in the third... |
| 12 | 21/02/2021 | Ongwae's buzzer-beater shocks Angola as Kenya ... | Game Report | YAOUNDE (Cameroon) On what was Kenya's biggest... |
| 13 | 20/02/2021 | Three tickets still available for AfroBasket 2021 | Review | On the day that Kenya caused the biggest upset... |
| 14 | 20/02/2021 | Romdhane show as Dokossi rises to summit | News | MONASTIR (Tunisia) - Of newbies and veterans e... |
Lets look at one final example piece of HTML. In this one we're going to use some basic javascript to add some elements to the page.
<html>
<head>
<script type='text/javascript'>
window.onload = function(){
for(i=0; i<5; i++){
var paragraph = document.createElement('p');
paragraph.innerHTML = 'This is paragraph '+i;
document.body.appendChild(paragraph);
}
}
</script>
</head>
<body>
</body>
</html>
If we open this .html file up in a web browser we get what we'd expect which is 5 paragraph objects labeled 0 to 4.
But when we try our usual approach of opening this file up into beautifulsoup we get the following result.
from bs4 import BeautifulSoup
with open('./dynamic_javascript.html', 'r') as f:
html = f.read()
soup = BeautifulSoup(html)
soup.find_all('p')
[]
This is because in order for the <p> tags to appear on the page there needs to be some process to execute the javascript that generates them. BeautifulSoup is just a HTML parser, it isn't able to execute javascript that is stored in the .html file.
This is a difficult problem to deal with, it would be useful if we could access the results of the .html file as it is rendered inside our browser, enter Selenium.
Selenium is a browser automation tool that is primarily used for testing websites, but can be put to a whole host of different tasks. What the Selenium library allows us to do is control a web browser using python and interact with the results.
Installing Selenium is a little trickier than most python packages, as in additional to a python library, we require a custom version of our web browser that can communicate with the Selenium library.
First we can pip install the selenium library
!pip install selenium
Then we need to go to https://sites.google.com/chromium.org/driver/ and download the latest stable release of our chrome driver. Now we need to make sure our selenium library can talk to the web driver, to do this we need to add it to our system path; a list of directories our computer looks for programs.
C:\binC:\binsetx PATH "%PATH%;C:\bin"
chromedriver -v
Once Selenium is installed we can import and use it just like any other package, the syntax of selenium is very similar to the packages we've looked at so far. In order to start a Selenium browser session we have to specify the type of browser that we're planning to use. As we installed the Chrome driver we can do this with the following code.
from selenium import webdriver
driver = webdriver.Chrome()
You'll notice that running this code opens up a new browser window with the message Chrome is being controlled by automated test software. This is the browser that python is going to control.
Be careful running this cell multiple times, everytime webdriver.Chrome() is called, it will start a new browser but wont close the old one.
Now we have a our web driver running, we can tell it to nagivate around to various pages using the driver.get(). For example if we wanted to open the Rwanda Men's basketball wikipedia page we would use:
driver.get('https://en.wikipedia.org/wiki/Rwanda_men%27s_national_basketball_team')
Similarly if we wanted to open our dynamic javascript page, we just need to tell the driver to navigate there. Opening a local file is a little different, first we need to use file:// rather than http:// and we also need to enter the full filepath, which we can get from os.getcwd().
import os
local_file = 'file://'+os.getcwd()+'dynamic_javascript.html'
driver.get(local_file)
Once we've navigated our browser to the right page there are several methods we can use to extract data processed HTML. Nearly all methods have a find_element and find_elements version, returning the first and a list of all the matching elements respectively.
| Driver Method | Usage |
|---|---|
| find_element_by_id | Select element by the id attribute |
| find_elements_by_name | Select elements by the name attribute |
| find_elements_by_xpath | Select elements by an XML path |
| find_elements_by_link_text | Select elements with specific hyperlink text |
| find_elements_by_partial_link_text | Select elements matching part of hyperlink text |
| find_elements_by_tag_name | Select elements by tag name / tag type |
| find_elements_by_class_name | Select elements with the same class |
| find_elements_by_css_selector | Select elements by CSS selectors |
We can use the find_elements_by_tag_name method to collect our dynamically generated <p> tags. What we get is a list of WebElement objects. We can use the .text property to retrieve the text inside the tag, or use .get_attribute('outerHTML') to extract the full tag as a string
import os
from selenium import webdriver
driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + '/dynamic_javascript.html'
driver.get(local_file)
p_tags = driver.find_elements_by_tag_name('p')
for tag in p_tags:
print(type(tag))
print(tag.get_attribute('outerHTML'))
print(tag.text)
<class 'selenium.webdriver.remote.webelement.WebElement'> <p>This is paragraph 0</p> This is paragraph 0 <class 'selenium.webdriver.remote.webelement.WebElement'> <p>This is paragraph 1</p> This is paragraph 1 <class 'selenium.webdriver.remote.webelement.WebElement'> <p>This is paragraph 2</p> This is paragraph 2 <class 'selenium.webdriver.remote.webelement.WebElement'> <p>This is paragraph 3</p> This is paragraph 3 <class 'selenium.webdriver.remote.webelement.WebElement'> <p>This is paragraph 4</p> This is paragraph 4
In some cases it may be more useful to use Selenium to generate the page, but then parse the resulting HTML using BeautifulSoup. Fortunately Selenium allows us to access the full HTML of the page including all of the generated elements.
import os
from selenium import webdriver
driver = webdriver.Chrome()
local_file = 'file://' + os.getcwd() + 'dynamic_javascript.html'
driver.get(local_file)
driver.get(local_file)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('p')
[<p>This is paragraph 0</p>, <p>This is paragraph 1</p>, <p>This is paragraph 2</p>, <p>This is paragraph 3</p>, <p>This is paragraph 4</p>]
Lets look at the fiba.basketball news page. If we go to the bottom of the page we can see there is a button that says Show More News. This button dynamically loads more news onto the page we're currently viewing.
We wouldn't be able to get this using requests alone, but maybe selenium can help. First we need to tell selenium that we want to click that button, but before we can click it we need to find it.
Using the Inspect tool we can see that the button has the class show_more_button so we can use that and a selenium class selector to isolate the element.
Once we've done that we can use the built in click method for WebElements to simulate us clicking the the Show More News button.
from selenium import webdriver
driver = webdriver.Chrome()
URL = 'https://www.fiba.basketball/afrobasket/2021/qualifiers/news'
driver.get(URL)
button = driver.find_element_by_class_name('show_more_button')
button.click()
Now if we use the same code we used previously to scrape this information, swapping out the request's call for the Selenium one, we can parse even more news than we did previously.
Note, clicking on the button doesn't generate the new news items instantly, it takes a moment for the browser to collect them. We need to add a wait function using time.sleep to wait for the page to load before we can scrape the data.
| date | headline | category | blurb | |
|---|---|---|---|---|
| 0 | 07/07/2021 | Enabu, Rwahwire reflect on Uganda tenacity ahe... | News | SALE (Morocco) - Qualifying for the FIBA AfroB... |
| 1 | 11/03/2021 | Madagascar's Botou wants to keep the AfroBaske... | News | ANTANANARIVO (Madagascar) – Despite having mis... |
| 2 | 10/03/2021 | 10 standout performers from the last window of... | News | ABIDJAN - As we look back at the Second Round ... |
| 3 | 08/03/2021 | Guinea celebrate second straight AfroBasket ap... | News | Guinea left Cameroon at the end of the FIBA Af... |
| 4 | 02/03/2021 | Decisions concerning the February window of th... | News | FIBA has taken decisions regarding Equatorial ... |
| 5 | 26/02/2021 | Impressive operational efforts in FIBA Contine... | News | MIES (Switzerland) - Another successful window... |
| 6 | 24/02/2021 | "Senegal have some room for improvement," says... | News | DAKAR (Senegal) - Senegal finished top of Grou... |
| 7 | 23/02/2021 | History Makers Kenya's confidence is sky-high | News | By beating eleven-time Africa champions Angola... |
| 8 | 21/02/2021 | Four teams undefeated at the end of AfroBasket... | Review | MONASTIR/YAOUNDE (Tunisia/Cameroon) - The 20-t... |
| 9 | 21/02/2021 | Top performers at Day 3 in Yaounde | News | YAOUNDE (Cameroon) - There was great frenzy at... |
| 10 | 21/02/2021 | Top performers as curtains fall on FIBA AfroBa... | News | MONASTIR (Tunisia) - With prestige and honor a... |
| 11 | 21/02/2021 | Aristide Mouaha from mop boy to Cameroon inter... | News | YAOUNDE (Cameroon) - The game was in the third... |
| 12 | 21/02/2021 | Ongwae's buzzer-beater shocks Angola as Kenya ... | Game Report | YAOUNDE (Cameroon) On what was Kenya's biggest... |
| 13 | 20/02/2021 | Three tickets still available for AfroBasket 2021 | Review | On the day that Kenya caused the biggest upset... |
| 14 | 20/02/2021 | Romdhane show as Dokossi rises to summit | News | MONASTIR (Tunisia) - Of newbies and veterans e... |
| 15 | 20/02/2021 | Ongwae, Ndoye, Nzeulie and Obiekwe dazzle in Y... | News | Magical and electrifying may not come close to... |
| 16 | 20/02/2021 | Liz Mills trailblazing for more female coaches... | News | YAOUNDE (Cameroon) - A dream nursed in Sydney,... |
| 17 | 20/02/2021 | Iroegbu brothers excited to continue Nigerian ... | Long Read | MONASTIR (Tunisia) - Playing with your sibling... |
| 18 | 20/02/2021 | Cote d'Ivoire's Konate defying age in style | News | YAOUNDE (Cameroon) - Cote d'Ivoire's Stephane ... |
| 19 | 20/02/2021 | Kouguere magic as Central African Republic edg... | Game Report | MONASTIR (Tunisia) - Central African Republic ... |
| 20 | 19/02/2021 | Nshobozwabyosenumukiza, Diogu sign out on a high | News | MONASTIR (Tunisia) - Rwanda point guard Jean J... |
| 21 | 19/02/2021 | Morais, Diallo, Mansare and Thompson star in Y... | News | YAOUNDE (Cameroon) - There was fireworks at th... |
| 22 | 19/02/2021 | Rwanda pick first win as four more teams quali... | Review | MONASTIR/YAOUNDE (Tunisia/Cameroon) - Rwanda p... |
| 23 | 19/02/2021 | Angola's Leonel Paulo bringing his experience ... | News | YAOUNDE (Cameroon) - When you've played at fiv... |
| 24 | 19/02/2021 | FIBA Statement about the February FIBA AfroBas... | Statement | Following the Covid-19 Protocol for FIBA Offic... |
| 25 | 19/02/2021 | Luol Deng's South Sudan revel in FIBA AfroBask... | News | MONASTIR (Tunisia) - It has been a long journe... |
| 26 | 19/02/2021 | Senegal's Ndoye "We want to stay unbeaten" | News | YAOUNDE (Cameroon) - Whenever five-time Afroba... |
| 27 | 18/02/2021 | Kuany, Doucoure and Omoerah in cloud nine at F... | News | MONASTIR (Tunisia) - Kuany Ngor Kuany was at t... |
| 28 | 18/02/2021 | South Sudan, Mali qualify for AfroBasket 2021,... | Review | Day 2 of February's window of the FIBA AfroBas... |
| 29 | 18/02/2021 | Players to watch out for in FIBA AfroBasket 20... | News | YAOUNDE (Cameroon) - The third and final windo... |